trec 2024
Auto-ARGUE: LLM-Based Report Generation Evaluation
Walden, William, Mason, Marc, Weller, Orion, Dietz, Laura, Conroy, John, Molino, Neil, Recknor, Hannah, Li, Bryan, Liu, Gabrielle Kaili-May, Hou, Yu, Lawrie, Dawn, Mayfield, James, Yang, Eugene
Generation of long-form, citation-backed reports is a primary use case for retrieval augmented generation (RAG) systems. While open-source evaluation tools exist for various RAG tasks, ones tailored to report generation (RG) are lacking. Accordingly, we introduce Auto-ARGUE, a robust LLM-based implementation of the recently proposed ARGUE framework for RG evaluation.
Lessons from the TREC Plain Language Adaptation of Biomedical Abstracts (PLABA) track
Ondov, Brian, Xia, William, Attal, Kush, Unde, Ishita, He, Jerry, Demner-Fushman, Dina
Objective: Recent advances in language models have shown potential to adapt professional-facing biomedical literature to plain language, making it accessible to patients and caregivers. However, their unpredictability, combined with the high potential for harm in this domain, means rigorous evaluation is necessary. Our goals with this track were to stimulate research and to provide high-quality evaluation of the most promising systems. Methods: We hosted the Plain Language Adaptation of Biomedical Abstracts (PLABA) track at the 2023 and 2024 Text Retrieval Conferences. Tasks included complete, sentence-level, rewriting of abstracts (Task 1) as well as identifying and replacing difficult terms (Task 2). For automatic evaluation of Task 1, we developed a four-fold set of professionally-written references. Submissions for both Tasks 1 and 2 were provided extensive manual evaluation from biomedical experts. Results: Twelve teams spanning twelve countries participated in the track, with models from multilayer perceptrons to large pretrained transformers. In manual judgments of Task 1, top-performing models rivaled human levels of factual accuracy and completeness, but not simplicity or brevity. Automatic, reference-based metrics generally did not correlate well with manual judgments. In Task 2, systems struggled with identifying difficult terms and classifying how to replace them. When generating replacements, however, LLM-based systems did well in manually judged accuracy, completeness, and simplicity, though not in brevity. Conclusion: The PLABA track showed promise for using Large Language Models to adapt biomedical literature for the general public, while also highlighting their deficiencies and the need for improved automatic benchmarking tools.
Support Evaluation for the TREC 2024 RAG Track: Comparing Human versus LLM Judges
Thakur, Nandan, Pradeep, Ronak, Upadhyay, Shivani, Campos, Daniel, Craswell, Nick, Lin, Jimmy
Retrieval-augmented generation (RAG) enables large language models (LLMs) to generate answers with citations from source documents containing "ground truth", thereby reducing system hallucinations. A crucial factor in RAG evaluation is "support", whether the information in the cited documents supports the answer. To this end, we conducted a large-scale comparative study of 45 participant submissions on 36 topics to the TREC 2024 RAG Track, comparing an automatic LLM judge (GPT-4o) against human judges for support assessment. We considered two conditions: (1) fully manual assessments from scratch and (2) manual assessments with post-editing of LLM predictions. Our results indicate that for 56% of the manual from-scratch assessments, human and GPT-4o predictions match perfectly (on a three-level scale), increasing to 72% in the manual with post-editing condition. Furthermore, by carefully analyzing the disagreements in an unbiased study, we found that an independent human judge correlates better with GPT-4o than a human judge, suggesting that LLM judges can be a reliable alternative for support assessment. To conclude, we provide a qualitative analysis of human and GPT-4o errors to help guide future iterations of support assessment.
UM_FHS at TREC 2024 PLABA: Exploration of Fine-tuning and AI agent approach for plain language adaptations of biomedical text
Kocbek, Primoz, Kopitar, Leon, Zhang, Zhihong, Aydin, Emirhan, Topaz, Maxim, Stiglic, Gregor
This paper describes our submissions to the TREC 2024 PLABA track with the aim to simplify biomedical abstracts for a K8 - level audience (13 - 14 years old students). We tested three approaches using OpenAI's gpt - 4o and gpt - 4o - mini models: baseline prompt engineering, a two - AI agent approach, and fine - tuning. Adaptations were evaluated using qualitative metrics ( 5 - point Likert scales for simplicity, accuracy, completeness, and brevity) and quantitative readability scores (Flesch - Kincaid grade level, SMOG Index). Results indicate d that the two - agent approach and baseline prompt engineering with gpt - 4o - mini models show superior qualitative performance, while fine - tuned models excelled in accuracy and completeness but were less simple. The evaluation results demonstrated that prompt engineering with gpt - 4o - mini outperforms iterative improvement strategies via two - agent approach as well as fine - tuning with gpt - 4o. We intend to expand our investigation of the results and explore advanced evaluations.
A Large-Scale Study of Relevance Assessments with Large Language Models: An Initial Look
Upadhyay, Shivani, Pradeep, Ronak, Thakur, Nandan, Campos, Daniel, Craswell, Nick, Soboroff, Ian, Dang, Hoa Trang, Lin, Jimmy
The application of large language models to provide relevance assessments presents exciting opportunities to advance information retrieval, natural language processing, and beyond, but to date many unknowns remain. This paper reports on the results of a large-scale evaluation (the TREC 2024 RAG Track) where four different relevance assessment approaches were deployed in situ: the "standard" fully manual process that NIST has implemented for decades and three different alternatives that take advantage of LLMs to different extents using the open-source UMBRELA tool. This setup allows us to correlate system rankings induced by the different approaches to characterize tradeoffs between cost and quality. We find that in terms of nDCG@20, nDCG@100, and Recall@100, system rankings induced by automatically generated relevance assessments from UMBRELA correlate highly with those induced by fully manual assessments across a diverse set of 77 runs from 19 teams. Our results suggest that automatically generated UMBRELA judgments can replace fully manual judgments to accurately capture run-level effectiveness. Surprisingly, we find that LLM assistance does not appear to increase correlation with fully manual assessments, suggesting that costs associated with human-in-the-loop processes do not bring obvious tangible benefits. Overall, human assessors appear to be stricter than UMBRELA in applying relevance criteria. Our work validates the use of LLMs in academic TREC-style evaluations and provides the foundation for future studies.
Ragnar\"ok: A Reusable RAG Framework and Baselines for TREC 2024 Retrieval-Augmented Generation Track
Pradeep, Ronak, Thakur, Nandan, Sharifymoghaddam, Sahel, Zhang, Eric, Nguyen, Ryan, Campos, Daniel, Craswell, Nick, Lin, Jimmy
Did you try out the new Bing Search? Or maybe you fiddled around with Google AI~Overviews? These might sound familiar because the modern-day search stack has recently evolved to include retrieval-augmented generation (RAG) systems. They allow searching and incorporating real-time data into large language models (LLMs) to provide a well-informed, attributed, concise summary in contrast to the traditional search paradigm that relies on displaying a ranked list of documents. Therefore, given these recent advancements, it is crucial to have an arena to build, test, visualize, and systematically evaluate RAG-based search systems. With this in mind, we propose the TREC 2024 RAG Track to foster innovation in evaluating RAG systems. In our work, we lay out the steps we've made towards making this track a reality -- we describe the details of our reusable framework, Ragnar\"ok, explain the curation of the new MS MARCO V2.1 collection choice, release the development topics for the track, and standardize the I/O definitions which assist the end user. Next, using Ragnar\"ok, we identify and provide key industrial baselines such as OpenAI's GPT-4o or Cohere's Command R+. Further, we introduce a web-based user interface for an interactive arena allowing benchmarking pairwise RAG systems by crowdsourcing. We open-source our Ragnar\"ok framework and baselines to achieve a unified standard for future RAG systems.